<
character> (UCS transformation format
8) An
ASCII-compatible multibyte
Unicode and
UCS encoding,
used by
Java and
Plan 9.
The
Unicode character set occupies a 16-bit code space. The
most obvious Unicode encoding (known as UCS-2) consists of a
sequence of 16-bit words. Such strings can contain bytes like
' ' or '/' which have a special meaning in filenames and
other
C library function parameters. In addition, the
majority of
Unix tools expects ASCII files and can't read
16-bit words as characters without major modifications. For
these reasons, UCS-2 is not a suitable external encoding of
Unicode in filenames, text files, environment variables, etc.
The
ISO 10646 Universal Character Set (UCS), a superset of
Unicode, occupies a 31-bit code space and the obvious UCS-4
encoding for it (a sequence of 32-bit words) has the same
problems.
The
UTF-
8 encoding of Unicode and UCS avoids the problems of
fixed-length Unicode encodings because an ASCII file encoded
in
UTF is exactly same as the original ASCII file and all
non-ASCII characters are guaranteed to have the most
significant bit set (bit 0x80). This means that normal tools
for text searching etc. work as expected.
UTF-
8 is defined in
RFC 2279.
[
"File System Safe UCS Transformation Format (FSS_UTF)",
X/Open Preliminary Specification, X/Open Company Ltd.,
Document Number: P316. This information also appears in
ISO/IEC 10646, Annex P].
{
Plan 9 UTF manual entry
(ftp://ftp.uu.net/doc/obi/Bell.Labs/plan9pm/09utf.ps.Z)}.
(1998-07-29)